-
Notifications
You must be signed in to change notification settings - Fork 42
Add support for running vLLM #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new vLLM workload: documentation, a vllm workload package (models, test definition, bench args), Slurm command-generation for aggregated/disaggregated runs, report parsing/renderer, registration wiring, reference SBATCH scripts, and tests for command generation and result parsing. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/test_test_scenario.py (1)
474-496:⚠️ Potential issue | 🟠 Major
VllmTestDefinitionis counted inreports_map(17) but not verified intest_custom_reporters.The registry size was bumped to 17 to account for the new vLLM reporter, but the parametrized
test_custom_reporterslist (Lines 480–496) still only has 16 entries and does not include(VllmTestDefinition, {VLLMBenchReportGenerationStrategy}). Add it to ensure the mapping is tested.💚 Proposed fix
(NixlPerftestTestDefinition, {NIXLKVBenchDummyReport}), (AiconfiguratorTestDefinition, {AiconfiguratorReportGenerationStrategy}), + (VllmTestDefinition, {VLLMBenchReportGenerationStrategy}), ],(Along with the corresponding imports at the top of the file.)
tests/test_init.py (1)
2-2:⚠️ Potential issue | 🟡 MinorFix copyright year to satisfy CI.
The pipeline reports: "Copyright header year is invalid. Expected year range ending 2026 but found 2025."
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Fix all issues with AI agents
In `@doc/workloads/vllm.rst`:
- Around line 86-94: The TOML snippet visually implies `[extra_env_vars]` is
nested under `[cmd_args.prefill]`; update the example by inserting a clear
separation (e.g., a blank line and/or a comment) between `[cmd_args.prefill]`
and `[extra_env_vars]`, or expand to show the surrounding test-definition
context so readers see that `[extra_env_vars]` is a sibling of `[cmd_args]`;
ensure the change references the existing section names `[cmd_args.prefill]` and
`[extra_env_vars]` so it’s obvious they are separate blocks.
- Line 81: Change the grammatical error in the sentence about enabling
disaggregation: replace "one need to set ``prefill`` configuration" with "one
needs to set the ``prefill`` configuration" (update the sentence in
doc/workloads/vllm.rst that references the ``prefill`` configuration).
In `@src/cloudai/registration.py`:
- Line 260: The test registration currently adds the VllmTestDefinition under
the lowercase key "vllm", causing lookups for "Vllm" to fail; change the
registration call in Registry().add_test_definition(...) to use the PascalCase
key "Vllm" (i.e., Registry().add_test_definition("Vllm", VllmTestDefinition)) so
test_template_name lookups match existing usages of "Vllm".
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 1-11: Add the standard SPDX and copyright header block at the very
top of this module (above all imports) to match the other new files in this PR:
include the lines starting with "# SPDX-FileCopyrightText: NVIDIA CORPORATION &
AFFILIATES" and the appropriate license identifier line (e.g., "#
SPDX-License-Identifier: Apache-2.0" or the license used in the repo). Ensure
the header appears before the existing imports and does not modify symbols such
as ReportGenerationStrategy or VLLM_BENCH_JSON_FILE so the rest of the file
(imports, BaseModel/ConfigDict usage, and references to VLLM_BENCH_JSON_FILE)
remain unchanged.
- Around line 63-64: Protect against division by zero when computing the percent
in the table.add_row call by checking results.num_prompts before performing
results.completed / results.num_prompts; if results.num_prompts is 0, substitute
a safe value or placeholder (e.g., "0.00% (0 of 0)" or "N/A") instead of
dividing. Update the code path around table.add_row in
report_generation_strategy.py (the block that formats f"{results.completed /
results.num_prompts * 100:.2f}% ({results.completed} of {results.num_prompts})")
to conditionally compute the percent only when results.num_prompts > 0 and
otherwise insert the chosen fallback string. Ensure the change uses the existing
results.completed and results.num_prompts symbols so callers and tests continue
to reference the same fields.
- Around line 31-40: The parse_vllm_bench_output function currently calls
json.load and VLLMBenchReport.model_validate without handling parsing/validation
errors; wrap the json.load and model_validate calls in a try/except that catches
json.JSONDecodeError and pydantic.ValidationError (and optionally generic
Exception) and returns None on failure so corrupt or unexpected files don't
crash report generation; keep the function signature and cache decorator
unchanged and reference both parse_vllm_bench_output and
VLLMBenchReport.model_validate when implementing the try/except.
In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 175-183: Both _gen_aggregated_script and _gen_disaggregated_script
declare the parameter output_path without a type; annotate output_path: Path to
match the codebase convention (e.g., def _gen_aggregated_script(...,
output_path: Path) -> str) and do the same for _gen_disaggregated_script; also
ensure Path is imported (from pathlib import Path) and update any references to
VllmCmdArgs or return typing if needed to keep type-checking happy.
- Around line 240-252: The decode-step exports (export
CUDA_VISIBLE_DEVICES="{decode_gpus}" and export
VLLM_NIXL_SIDE_CHANNEL_PORT=$DECODE_NIXL_PORT) leak into subsequent proxy and
benchmark srun steps; either scope the decode exports in a subshell around the
srun invocation (so the environment resets after the command), or add explicit
unsets (unset CUDA_VISIBLE_DEVICES; unset VLLM_NIXL_SIDE_CHANNEL_PORT)
immediately after DECODE_PID is set; locate the block around
PREFILL_PID/DECODE_PID and modify the decode section that constructs
decode_cmd/vllm-decode.log to apply one of these fixes.
- Around line 30-32: The methods image_path, get_vllm_serve_commands,
get_proxy_command, get_vllm_bench_command, generate_wait_for_health_function,
and _gen_srun_command duplicate casting of self.test_run.test to
VllmTestDefinition; replace those casts with the existing self.tdef property to
remove duplication and ensure consistent access to the VllmTestDefinition
instance (update image_path to return str(self.tdef.docker_image.installed_path)
and similarly use self.tdef inside each of the listed methods).
In `@src/cloudai/workloads/vllm/vllm.py`:
- Around line 106-115: The log parsing in was_run_successful is fragile: when
reading lines after the "============ Serving Benchmark Result ============"
marker the code assumes line.split()[2] exists and is an integer
(num_successful_requests); change this to defensively parse the "Successful
requests:" line in the was_run_successful function by first verifying the split
has at least three parts (or use a regex to extract the digits after "Successful
requests:"), and wrap the int conversion in a try/except to handle ValueError;
if parsing fails, skip that line and continue, and only return
JobStatusResult(is_successful=True) when a valid positive integer is obtained.
- Around line 36-42: The serve_args property currently stringifies all values
from self.model_dump(exclude={"gpu_ids"}) which produces invalid CLI forms for
None, bool, and list types; update serve_args to skip None values, treat
booleans as flags (include "--key" when True and omit when False), and serialize
list values into the expected vLLM CLI form (e.g., join items with commas or
repeat the flag as appropriate for your CLI) while still converting other
primitive types to strings and preserving the key-to-flag conversion
(k.replace('_','-')) so that serve_args emits valid arguments for vllm serve.
- Around line 79-80: The private attributes _docker_image and _hf_model should
be declared using Pydantic v2's PrivateAttr to be stable: import PrivateAttr
from pydantic and replace the bare annotated defaults (`_docker_image:
DockerImage | None = None`, `_hf_model: HFModel | None = None`) with
PrivateAttr(default=None) assignments (e.g., set `_docker_image =
PrivateAttr(default=None)` and `_hf_model = PrivateAttr(default=None)`) so the
vllm model class uses the supported private-attribute API.
In
`@tests/job_status_retrieval_strategy/test_vllm_job_status_retrieval_strategy.py`:
- Line 64: Rename the test method currently named test_no_succesfull_requests to
test_no_successful_requests (fix the typo) in the test class so it follows
naming conventions; update any references or usages (e.g., test discovery or
calls) to the old name, and ensure the test decorator or fixtures (like base_tr:
TestRun) remain unchanged—look for the function definition def
test_no_succesfull_requests(...) and change its identifier to def
test_no_successful_requests(...).
In `@tests/ref_data/vllm-disagg.sbatch`:
- Around line 48-59: The decode srun exports CUDA_VISIBLE_DEVICES and
VLLM_NIXL_SIDE_CHANNEL_PORT into the environment (see the second vllm serve
launch and PREFILL_PID usage), and because later srun calls use --export=ALL
those values leak into proxy/benchmark steps; after launching the decode srun
(the vllm serve --port 8200 --kv-transfer-config ... & block that sets
PREFILL_PID/DECODE), explicitly unset or reset CUDA_VISIBLE_DEVICES and
VLLM_NIXL_SIDE_CHANNEL_PORT (or set them to empty/default) so subsequent srun
commands don’t inherit them, ensuring only the intended vllm processes see those
vars.
In `@tests/test_test_scenario.py`:
- Around line 474-475: Update the copyright header in
tests/test_test_scenario.py to end with 2026 (change the year range on the
header line from 2024-2025 to 2024-2026); open the file, edit the top-of-file
copyright comment (near the file header, not in the test body or the
test_default_reporters_size function), save and run the CI check to confirm the
header now matches 2024-2026.
tests/job_status_retrieval_strategy/test_vllm_job_status_retrieval_strategy.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_test_scenario.py (1)
474-499:⚠️ Potential issue | 🟡 MinorMissing
VllmTestDefinitionentry intest_custom_reporters.The count on Line 475 was bumped to 17 to account for the new vLLM reporter, but the parametrized
test_custom_reporters(lines 477–496) still only covers 16 entries — it does not include a(VllmTestDefinition, {VLLMBenchReportGenerationStrategy})case. This means the vLLM reporter registration is never actually verified.🧪 Proposed fix — add the missing parametrize entry
(NixlPerftestTestDefinition, {NIXLKVBenchDummyReport}), (AiconfiguratorTestDefinition, {AiconfiguratorReportGenerationStrategy}), + (VllmTestDefinition, {VLLMBenchReportGenerationStrategy}), ],This will also require adding the corresponding imports at the top of the file:
from cloudai.workloads.vllm import VLLMBenchReportGenerationStrategy, VllmTestDefinition
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…eval_strategy.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@doc/workloads/vllm.rst`:
- Around line 92-93: The disaggregated TOML snippet is ambiguous because
[extra_env_vars] is shown at top level without context; update the example in
vllm.rst to show a complete TOML block including the top-level keys (e.g., name,
test_template_name, executor/test settings) and explicitly show whether
[extra_env_vars] is a sibling of [cmd_args] or nested under it (for example,
include [cmd_args] with its keys then a separate [extra_env_vars] section), so
readers can unambiguously see the intended section hierarchy and placement of
CUDA_VISIBLE_DEVICES.
- Line 73: Replace the awkward phrase "from less priority to more priority" in
the sentence "The number of GPUs can be controlled using the options below,
listed from less priority to more priority:" with a clearer alternative such as
"from lowest to highest priority" or "in order of increasing priority" so the
sentence reads e.g. "The number of GPUs can be controlled using the options
below, listed from lowest to highest priority:"; update the string where that
sentence appears in the vllm.rst documentation.
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 47-48: The use of functools.cache on parse_vllm_bench_output
causes indefinite memoization by Path and can return stale results if the file
changes; update the function to either remove the `@cache` decorator or change the
cache key to include the file's modification state (e.g., use an explicit
memoization keyed by (res_file, res_file.stat().st_mtime) or a TTL/lru cache) so
cached entries are invalidated when the file is updated; locate
parse_vllm_bench_output and replace the `@cache` usage with one of these
strategies to ensure fresh results for changed files.
- Around line 53-58: The except clause in the block that opens res_file and
calls VLLMBenchReport.model_validate(data) is redundant because
json.JSONDecodeError is already an Exception; update the except clause from
"except (json.JSONDecodeError, Exception):" to a single "except Exception:" so
it no longer lists duplicate exception types while preserving the current error
handling behavior.
In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 258-268: The script launches the proxy in background (proxy_cmd,
PROXY_PID) and immediately starts the benchmark (bench_cmd), causing potential
failures if the proxy isn't ready; update the generated shell to wait for proxy
readiness by invoking the existing wait_for_health helper (or a short sleep)
against the proxy endpoint after starting the proxy and before running
bench_cmd, ensuring the health check references the same proxy port/URL used by
proxy_cmd and still retains PROXY_PID handling.
In `@tests/slurm_command_gen_strategy/test_vllm_slurm_command_gen_strategy.py`:
- Around line 55-60: The fixture vllm_disagg_tr mutates the shared vllm fixture;
instead create a fresh VllmTestDefinition instance (or deep copy the existing
vllm) inside vllm_disagg_tr, set its extra_env_vars =
{"CUDA_VISIBLE_DEVICES":"0,1,2,3"} and its cmd_args.prefill = VllmArgs() on that
new instance, then pass the new instance to TestRun(test=...) so vllm remains
unchanged; reference the vllm_disagg_tr fixture, VllmTestDefinition (or use
copy.deepcopy(vllm)), TestRun, and VllmArgs when making the change.
tests/slurm_command_gen_strategy/test_vllm_slurm_command_gen_strategy.py
Show resolved
Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Greptile OverviewGreptile SummaryThis PR adds a new
Main issues to address before merge are in the vLLM Slurm command generation: (1) bench “extra args” are currently wired through Confidence Score: 3/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 files reviewed, 2 comments
Additional Comments (2)
Consider returning
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 31-45: VLLMBenchReport defines std_ttft_ms and std_tpot_ms but
they aren't shown in the generated report table; either remove these fields or
add them to the displayed metrics—update the report-generation code that
currently renders mean_ttft_ms/median_ttft_ms/p99_ttft_ms and
mean_tpot_ms/median_tpot_ms/p99_tpot_ms to also include std_ttft_ms and
std_tpot_ms (add headers, column values and formatting consistent with the other
stats), or delete std_ttft_ms/std_tpot_ms from VLLMBenchReport if intentionally
unused; ensure any serialization/deserialization and tests reference the updated
schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 64-65: The cache key issue comes from passing potentially
non-normalized Path objects into parse_vllm_bench_output from
can_handle_directory; update can_handle_directory to resolve the path (e.g.,
call self.test_run.output_path.resolve() or resolve() on the
VLLM_BENCH_JSON_FILE path) before passing it to parse_vllm_bench_output so the
cached key is consistent with generate_report and other callers, and likewise
ensure any other call sites (like generate_report) also resolve the path before
invoking parse_vllm_bench_output to avoid inconsistent cache hits/misses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
14 files reviewed, 3 comments
|
|
||
| @property | ||
| def gpu_ids(self) -> list[int]: | ||
| cuda_devices = str(self.test_run.test.extra_env_vars.get("CUDA_VISIBLE_DEVICES", "")) | ||
| if (self.tdef.cmd_args.prefill and self.tdef.cmd_args.prefill.gpu_ids) and self.tdef.cmd_args.decode.gpu_ids: | ||
| cuda_devices = f"{self.tdef.cmd_args.prefill.gpu_ids},{self.tdef.cmd_args.decode.gpu_ids}" | ||
| if cuda_devices: | ||
| return [int(gpu_id) for gpu_id in cuda_devices.split(",")] | ||
| return list(range(self.system.gpus_per_node or 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagg GPU override ignored
In gpu_ids (line ~41), the override branch requires self.tdef.cmd_args.prefill and self.tdef.cmd_args.prefill.gpu_ids before using the explicit decode.gpu_ids. This means a user who sets only decode.gpu_ids (or only prefill.gpu_ids) won’t get their explicit selection reflected in gpu_ids, and disaggregated scripts will fall back to splitting CUDA_VISIBLE_DEVICES/system GPUs instead. If explicit prefill.gpu_ids or decode.gpu_ids is set, gpu_ids should be derived from whichever is provided (and validated) rather than requiring both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are options:
- No
gpu_idsset -> use either system orCUDA_VISIBLE_DEVICES. - IDs are set for both
decodeandprefill-> ignore env var and system settings and use explicit values from a user. - One of
decode.gpu_idsorprefill.gpu_idsis set -> unclear what to use for other part. Potentially, we can implement smart extraction based on input and values from (1). E.g.decode.gpu_ids = "0,1"and total GPUs set to 4 (or via env var), thenprefill.gpu_idsshould be equal to"2,3".
Only (3) is not done yet. And this logic might be very fragile and lead to unexpected results. Current approach is also questionable, though.
@podkidyshev let's discuss pros and cons here and maybe delay the decision till we get some feedback from the users.
Summary
Two modes are supported at the moment, single node only:
Test Plan
Additional Notes
–